Concurrent Learning of Large-Scale Random Forests

نویسنده

  • Henrik Boström
چکیده

The random forest algorithm belongs to the class of ensemble learning methods that are embarassingly parallel, i.e., the learning task can be straightforwardly divided into subtasks that can be solved independently by concurrent processes. A parallel version of the random forest algorithm has been implemented in Erlang, a concurrent programming language originally developed for telecommunication applications. The implementation can be used for generating very large forests, or handling very large datasets, in a reasonable time frame. This allows for investigating potential gains in predictive performance from generating large-scale forests. An empirical investigation on 34 datasets from the UCI repository shows that forests of 1000 trees significantly outperform forests of 100 trees with respect to accuracy, area under ROC curve (AUC) and Brier score. However, increasing the forest sizes to 10 000 or 100 000 trees does not give any further significant performance gains.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault Locating in High Voltage Transmission Lines Based on Harmonic Components of One-end Voltage Using Random Forests

In this paper, an approach is proposed for accurate locating of single phase faults in transmission lines using voltage signals measured at one-end. In this method, harmonic components of the voltage signals are extracted through Discrete Fourier Transform (DFT) and are normalized by a transformation. The proposed fault locator, which is designed based on Random Forests (RF) algorithm, is train...

متن کامل

Concurrent control on resource planning and revenue/expenditure estimation in large-scale shell material embankment projects management using discrete-event simulation

Resource planning in large-scale construction projects has been a complicated management issue requiring mechanisms to facilitate decision making for managers. In the present study, a computer-aided simulation model is developed based on concurrent control of resources and revenue/expenditure. The proposed method responds to the demand of resource management and scheduling in shell material emb...

متن کامل

An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics

We present results from a large-scale empirical comparison between ten learning methods: SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We evaluate the methods on binary classification problems using nine performance criteria: accuracy, squared error, cross-entropy, ROC Area, F-score, p...

متن کامل

Results from a Semi-Supervised Feature Learning Competition

We present results from a recent large-scale semi-supervised feature learning competition, which attracted twenty-nine teams and 238 total submissions. The learning task was drawn from a real world task in malicious url classification. This was a large scale binary classification task, with a sparse feature space of one million features, and training data sets of 50,000 labeled examples and one...

متن کامل

Rapid Feature Selection Based on Random Forests for High-Dimensional Data

One of the important issues of machine learning is obtaining essential information from high-dimensional data for discrimination. Dimensionality reduction is a means to reduce the burden of dimensionality due to large-scale data. Feature selection determines significant variables and contributes to dimensionality reduction. In recent years, the random forests method has been the focus of resear...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011